Introduction


NHL goalies are kind of an enigma. They’re notoriously hard to evaluate and predict in a sport that is already so chaotic. They are also incredibly important to their teams and often you will hear hockey analysts citing teams’ goalies as the reason that the team is preforming well or poorly.

We are interested specifically in evaluating NHL goalies for the purpose of understanding their contracts. How do teams decide how much to pay their goalies? We want to inspect goalies in the post-lockout seasons and take a look at the best, the worst and how much they get paid.

Objective


The objective of this report is to evaluate NHL goalies and their contracts in the post-lockout era and then build a model to try and predict how much goalies will get paid on their next contract.

NHL Goalies


We wanted to specifically preform this analysis during the post-lockout era (2013-present). Hockey has changed so much in these seasons, comparing goalies from post-lockout era hockey to pre-lockout doesn’t quite make sense.

Overview of Goalies

How many unique goalies played per season?

hr_data %>%
  distinct(player, szn) %>%
  count(szn, name = 'Number of goalies') %>%
  rename(Season = szn) %>%
  kable() %>%
  kable_styling(bootstrap_options = c('hover'), full_width = F)
Season Number of goalies
12-13 82
13-14 97
14-15 92
15-16 92
16-17 95
17-18 95
18-19 93

We see that ~95 goalies played each season. This may seem like a lot but this actually makes sense. From 2013-2017 there were 30 NHL teams in the league and from 2017-2019 there were 31. Each NHL team usually has a starting goalie, a back-up goalie, and a third-string “emergency” goalie. Teams calling-up goalies from their farm teams in emergency situations is also common. So you would (assuming that injuries are uniform) that number of teams in league * 3 is the approximate number of goalies who play each season.

In total there were 187 goalies that played at least 1 game in the NHL from 2013-2019. However, it makes no sense to evaluate a goalie based on their performance in just a few games because of the variation between events occuring in each hockey game. One of the many mysteries surrounding goalies is after how many games can we draw definitive conclusions about their quality of play? Let’s look at how many games goalies were playing per season and over their whole careers.

gp_szn <-
  ggplot(hr_data, aes(gp)) +
  geom_histogram(binwidth = 5, fill = '#a7d2cb', alpha = 0.8) +
  labs(x = 'Games played per season by goalies', y = 'Frequency') +
  theme(plot.caption = element_text(color = 'grey50'))

gp_overall <-
  hr_data %>%
  group_by(player) %>%
  summarise(gp = sum(gp)) %>%
  ggplot(aes(gp)) +
  geom_histogram(binwidth = 25, fill = 'thistle', alpha = 0.8) +
  labs(x = 'Games played overall by goalies', y = 'Frequency',
       caption = '1 Game played = 1 or more minute played on ice') +
  theme(plot.caption = element_text(color = 'grey50'))

gp_szn + gp_overall

In the case of this analysis, we were interested specifically in goalies who had played 40 or more games in the NHL from 2013-2019. Our reasoning behind this is that a full NHL season is 82 games so 40 games is about half a “season” played and seems like a reasonably large enough sample to evaluate a goalie’s play. The one downside to this evaluation is that it will devalue the contributions of back-up goalies, but back-up goalies usually don’t last as long as starters anyway and thus, their contracts tend to stay around league minimum.

Fact: In 2018-19, between goalies getting constantly blown out and injured, the Philadelphia Flyers were infamous for their goaltending trouble. They iced 8 goalies during the regular season, the highest of any team post-lockout.

read_csv('data/flyers_goaliedata_19.csv') %>%
  clean_names() %>%
  arrange(desc(gp)) %>%
  select(Player = player, Team = tm, `Games played` = gp, `Save %` = sv_percent) %>%
  kable() %>%
  kable_styling(bootstrap_options = c('condensed', 'hover'), full_width = F)
Player Team Games played Save %
Carter Hart PHI 31 0.917
Brian Elliott PHI 26 0.907
Anthony Stolarz PHI 12 0.902
Calvin Pickard PHI 11 0.863
Michal Neuvirth PHI 7 0.859
Cam Talbot PHI 4 0.881
Alex Lyon PHI 2 0.806
Mike McKenna PHI 1 0.833

Goalie demographics

It is well-known that most hockey players origniate from North America, Russia, or Scandinavia. Is this true for goalies specifically?

goalie_countries <- 
  cf_data %>%
    count(country, name = "num_of_goalies") %>%
    na.omit()

world <- 
  ne_countries(
    scale = "medium", 
    returnclass = "sf", 
    continent = c('North America', 'Europe')
  )

world %>%
  left_join(goalie_countries, by = c('name' = 'country')) %>%
  ggplot() +
  geom_sf(aes(fill = num_of_goalies)) +
  labs(fill = "Number of goalies") + 
  scale_fill_fish(option = "Prionace_glauca", direction = -1) +
  theme_void() +
  theme(legend.position = "bottom",
        legend.key.height = unit(2, 'mm'),
        legend.text = element_text(size = 7),
        legend.title = element_text(size = 8))

We can see that this holds true for goalies for the most part, but it’s interesting to note that there are goalies who originate from Denmark and even the UK.

The old hockey idea is that goalies are supposed to be big. The taller and wider they are, the more space they take up in the net. However, with the evolution of hockey skill came the need for more athletic goalies with lightning reflexes. The butterfly style, now commonplace in the NHL, requires goalies to be flexible. So what do goalies look like in the NHL now?

cf_data_player_height_weight <- 
  cf_data %>%
  select(player, weight, height, age, country) %>%
  mutate(
    weight = parse_number(sub(".*-","", weight)), 
    height = parse_number(sub(".*-","",height))
    ) %>%
  mutate(age = case_when(
    age < 20 ~ 'Under 20',
    age >= 20 & age < 25 ~ '20-24',
    age >= 25 & age < 30 ~ '25-29',
    age >= 30 & age < 35 ~ '30-34',
    age >= 35 & age < 40 ~ '35-39',
    age > 40 ~ 'Over 40',
  ))

p <-
  cf_data_player_height_weight %>%
  ggplot(aes(height, weight)) +
  geom_jitter(aes(color = age, label = player, label2 = country), 
              alpha = 0.65, size = 3) +
  labs(x = 'Goalies\'s Height (cm)', 
       y = 'Goalies\'s Weight (kg)', 
       color = 'Age') +
  scale_color_fish_d(option = 'Callanthias_australis')

ggplotly(p)

So we can see that the largest chunck of goalies seems to fall between 183 - 193cm (6’0 - 6’3 ft) in height and 85 - 95kg (187 - 210lbs) in weight. So only slightly taller and heavier than the average adult male, which is what you would expect from a professional athelete but is not quite what you would expect if you tend to think of goalies as big players.

The Best (and Worst) of the NHL

The top 10 averages by goalie

hr_data <- hr_data %>% mutate(w_percent = w/gp)

Salaries <- cf_data %>% 
  group_by(player) %>% 
  summarise(mean_aav = mean(aav)) %>% 
  mutate(player = replace(player,player == "Marc-André Fleury","Marc-Andre Fleury")) %>%
  mutate(player = replace(player,player == "Jaroslav Halák","Jaroslav Halak")) %>% 
  mutate(player = replace(player, player == "Eddie Läck","Eddie Lack")) %>% 
  mutate(player = replace(player, player == "Jacob Markström","Jacob Markstrom")) %>%
  mutate(player = replace(player, player == "Petr Mrázek","Petr Mrazek"))

Best_Goalies <- 
  hr_data %>% 
  filter(!(player == "Martin Jones" & szn == "14-15")) %>%
  group_by(player) %>% 
  summarise(gp = sum(gp),
            w = sum(w),
            ga = sum(ga),
            sa = sum(sa),
            sv = sum(sv),
            ) %>%
  filter(gp > 120) %>% 
  mutate(mean_w_percent = w/gp, 
         mean_sv_percent = sv/sa,
         mean_gaa = ga/gp,
         avg_sv = mean(mean_sv_percent),
         mean_gsaa = (sa * (1-avg_sv))- ga,
         )

Best_Goalies <- Best_Goalies %>% 
  left_join(Salaries, by = c("player" = "player"))

Mean_sv_plot <- 
  Best_Goalies %>% 
  top_n(10, mean_sv_percent) %>%
  ggplot(aes(x = mean_sv_percent, y = mean_aav)) + 
  geom_point(size = 2, color = '#a7d2cb') +
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Save Percentage", y = "Average Annual Salary")

Mean_gaa_plot <- 
  Best_Goalies %>% 
  top_n(-10, mean_gaa) %>%
  ggplot(aes(x = mean_gaa, y = mean_aav)) + 
  geom_point(size = 2, color = '#f2d388') +
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Goals Against Average", y = "Average Annual Salary")

mean_w_percent_plot <- 
  Best_Goalies %>% 
  top_n(10, mean_w_percent) %>% 
  ggplot(aes(x = mean_w_percent, y = mean_aav)) + 
  geom_point(size = 2, color = '#c98474') + 
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Percentage of Games Won", y = "Average Annual Salary")

mean_gsaa_plot <- 
  Best_Goalies %>% 
  top_n(10, mean_gsaa) %>% 
  ggplot(aes(x = mean_gsaa,y = mean_aav))+ 
  geom_point(size = 2, color = '#874c62') + 
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Goals Saved Above Average", y = "Average Annual Salary")

(Mean_sv_plot | Mean_gaa_plot) / (mean_w_percent_plot | mean_gsaa_plot)

Top goalies by season

# Finding the best goalies by Season
# creates tables for each stat we are looking at

top_goalie_sv <- function(season){
  top_goalie_stats <- hr_data %>% filter(szn == season, gp > 40) 
  top_goalie_sv_percent <- top_goalie_stats %>% top_n(1,sv_percent) %>%
    select(player, sv_percent, szn)
  return(top_goalie_sv_percent)
}

top_goalie_gaa <- function(season){
  top_goalie_stats <- hr_data %>% 
    filter(szn == season, gp > 40)
  goalie_gaa <- top_goalie_stats %>% 
    top_n(-1,gaa) %>% 
    select(player, gaa, szn)
  return(goalie_gaa)
}


top_goalie_w_percent <- function(season){
  top_goalie_stats <- hr_data %>% filter(szn == season, gp > 40)
  goalie_w_percent <- top_goalie_stats %>% top_n(1,w_percent) %>% 
    select(player, w_percent, szn)
  return(goalie_w_percent)
}


top_goalie_gsaa <- function(season){
  top_goalie_stats <- hr_data %>% filter(szn == season, gp > 40)
  goalie_gsaa <- top_goalie_stats %>% top_n(1,gsaa) %>% 
    select(player, gsaa, szn)
  return(goalie_gsaa)
}

#Loop to fill in the tables

best_sv_year <- tibble()
best_gaa_year <- tibble()
best_win_percent_year <- tibble()
best_gsaa_year <- tibble()
for (i in 1:7){
  best_sv_year <- bind_rows(best_sv_year,top_goalie_sv(hr_szns[i]))
  best_gaa_year <- bind_rows(best_gaa_year,top_goalie_gaa(hr_szns[i]))
  best_win_percent_year <- bind_rows(best_win_percent_year,top_goalie_w_percent(hr_szns[i]))
  best_gsaa_year <- bind_rows(best_gsaa_year,top_goalie_gsaa(hr_szns[i]))
}



#Plots to see best goalies by season 


sv_plot <- 
  best_sv_year %>% 
  ggplot(aes(x = szn, y = sv_percent)) + 
  geom_point(size = 2, color = '#a7d2cb') +
  geom_text_repel(aes(label = player), size = 3) +
  coord_flip() + 
  labs(x = "Season", y = "Save Percentage")

gaa_plot <- 
  best_gaa_year %>% 
  ggplot(aes(x =szn, y = gaa)) + 
  geom_point(size = 2, color = '#f2d388') +
  geom_text_repel(aes(label = player), size = 3) +
  coord_flip() + 
  labs(x = "Season", y = "Goals Against Average")

w_percent_plot <- 
  best_win_percent_year %>% 
  ggplot(aes(x = szn, y = w_percent)) + 
  geom_point(size = 2, color = '#c98474') + 
  geom_text_repel(aes(label = player), size = 3) +
  coord_flip() + 
  labs(x = "Season", y = "Win Percentage")

gsaa_plot <- 
  best_gsaa_year %>% 
  ggplot(aes(x = szn, y = gsaa)) + 
  geom_point(size = 2, color = '#874c62') + 
  geom_text_repel(aes(label = player), size = 3) +
  coord_flip() + 
  labs(x = "Season", y = "Goals Saved Above Average")

year_plots <- (sv_plot | gaa_plot) / (w_percent_plot | gsaa_plot)

year_plots 

# + plot_annotation(
#   title = "Top Goalies by Season"
# )

Worst Goalies over past seasons

#Worst Goalies by averages overall

worst_mean_sv_plot <- Best_Goalies %>% 
  top_n(-10,mean_sv_percent) %>%
  ggplot(aes(x = mean_sv_percent, y = mean_aav)) + 
  geom_point(size = 2, color = '#a7d2cb') +
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Save Percentage", y = "Average Annual Salary")

worst_mean_gaa_plot <- Best_Goalies %>% 
  top_n(10,mean_gaa) %>%
  ggplot(aes(x = mean_gaa, y = mean_aav)) + 
  geom_point(size = 2, color = '#f2d388') +
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Goals Against Average", y = "Average Annual Salary")

worst_mean_w_percent_plot <- Best_Goalies %>% 
  top_n(-10, mean_w_percent) %>% 
  ggplot(aes(x = mean_w_percent, y = mean_aav)) + 
  geom_point(size = 2, color = '#c98474') + 
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Percentage of Games Won", y = "Average Annual Salary")


worst_mean_gsaa_plot <- Best_Goalies %>% 
  top_n(-10, mean_gsaa) %>% 
  ggplot(aes(x = mean_gsaa,y = mean_aav))+ 
  geom_point(size = 2, color = '#874c62') + 
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Goals Saved Above Average", y = "Average Annual Salary")


Worst_average_plots <- (worst_mean_sv_plot | worst_mean_gaa_plot) / (worst_mean_w_percent_plot | worst_mean_gsaa_plot)

Worst_average_plots 

Worst goalies over last 6 seasons

# Some really bad goalies by year 

missing_salary <- 
  read_xls("data/MH_nhl_goalies_2017-2018.xls") %>% 
  clean_names() %>% 
  unite('player', c('first_name', 'last_name'), sep = ' ') %>% 
  filter(player == "Scott Darling") %>% 
  select(player, salary)

worst_goalies <- hr_data %>%
  filter(gp > 40 & sv_percent < 0.9)

worst_salaries <- cf_data %>% select(player, aav, szn)

worst_goalies <- worst_goalies %>% 
  left_join(worst_salaries,by = c("player" = "player", "szn" = "szn")) %>%
  left_join(missing_salary, by = c("player" = "player")) %>% 
  mutate(aav = replace_na(aav,0)) %>% 
  mutate(salary = replace_na(salary,0)) %>% 
  mutate(aav = aav+ salary) %>% 
  unite('player_szn', c('player','szn'), sep = ' | Season:', remove = FALSE) %>%
  select(-salary) %>%
  mutate(w_percent = w/gp)

wg_sv_percent <- worst_goalies %>% 
  ggplot(aes(x = sv_percent, y = aav)) + 
  geom_point(size = 2, color = '#a7d2cb') +
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Save Percentage", y = "Annual Average Salary")

wg_gsaa <- worst_goalies %>% ggplot(aes(x = gsaa, y = aav)) + 
  geom_point(size = 2, color = '#f2d388') +
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Goals Saved Above Average", y = "Annual Average Salary")

wg_gaa <- worst_goalies %>% ggplot(aes(x = gaa, y = aav)) + 
  geom_point(size = 2, color = '#c98474') + 
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Goals Against Average", y = "Annual Average Salary")

wg_w_percent <- worst_goalies %>% 
  ggplot(aes(x = w_percent, y = aav)) + 
  geom_point(size = 2, color = '#874c62') + 
  geom_text_repel(aes(label = player), size = 3) +
  labs(x = "Win Percentage", y = "Annual Average Salary")

Worst_goalies_plot <- (wg_sv_percent | wg_gaa) / (wg_w_percent | wg_gsaa)

Worst_goalies_plot 

# + plot_annotation(
#   title = "Worst perfoming goalies and how much they get paid"
# )

First place team goalie performance

Red point is league average.

league_avgs <-
  hr_data %>%
  filter(gp > 20) %>%
  group_by(szn) %>%
  mutate(w_percent = w/gp) %>%
  summarise(
    avg_sv_percent = mean(sv_percent),
    avg_gaa = mean(gaa),
    avg_w_percent = mean(w_percent),
    avg_gsaa = mean(gsaa)
  )

starting_goalies_best <- 
  standings_data %>%
    filter(rk == 1) %>%
    select(-w, -l, -ol, -pts) %>%
    left_join(read_csv('data/NHLTeams.csv')) %>%
    left_join(hr_data, by = c('abbrev' = 'team', 'szn' = 'szn')) %>%
    group_by(team, szn) %>%
    top_n(1, gp) %>%
    mutate(w_percent = w/gp)

starting_goalies_sv_percent <- 
  ggplot() +
    geom_point(data = starting_goalies_best, 
               aes(x = szn, y = sv_percent, color = team), 
               show.legend = F, size = 4) +
    geom_point(data = league_avgs,
               aes(x = szn, y = avg_sv_percent),
               show.legend = F, size = 3, color = 'red3') +
    geom_text_repel(data = starting_goalies_best, 
                    aes(x = szn, y = sv_percent, label = player), 
                    size = 3) +
    labs(y = 'Save percent', x = 'Season') +
    ggtitle('Save percent') +
    scale_color_fish_d(option = "Callanthias_australis") +
    coord_flip() +
    theme_hc()

starting_goalies_gaa <- 
  ggplot() +
    geom_point(data = starting_goalies_best, 
               aes(x = szn, y = gaa, color = team), 
               show.legend = F, size = 4) +
    geom_point(data = league_avgs,
               aes(x = szn, y = avg_gaa),
               show.legend = F, size = 3, color = 'red3') +
    geom_text_repel(data = starting_goalies_best, 
                    aes(x = szn, y = gaa, label = player), 
                    size = 3) +
    labs(y = 'Goals against avg', x = NULL) +
    ggtitle('Goals against average') +
    scale_color_fish_d(option = "Callanthias_australis") +
    coord_flip()

starting_goalies_comb <- starting_goalies_sv_percent / starting_goalies_gaa
starting_goalies_comb

Goalie Contracts


Now, we will dive into goalie contracts. How much have goalies been getting paid in the last 6 seasons and how has that been changing? Have contract values been going up or down? Who gets paid the most and do they work hard for the money?

When analyzing goalie contract, most emphasis was put on analyzing the average annual value (AAV) of the contract (i.e. how much they make per year). The way contracts work in the NHL (and most professional sports with a salary cap) can be confusing. Some contracts are front-heavy (get paid the majority of the contract in the first few years), some are padded with bonuses from the signing teams, some have salary retention from previous teams. All of this is done to try and circumvent the salary cap, on a team management level. Thus, we feel it’s safe to analyze goalies using their AAV since using actual salary would be hard to understand, interpret, and every salary varies from goalie to goalie and team to team. At the end of the day, every goalie gets the full-value of their contract and its easy to understand and interepret as if they got paid an equal amount of their contract every season.

How much goalies get paid

So on average how much has an NHL goaltender, who has played more than 40 NHL games, gotten paid?

aav_gp_data <-
  cf_data %>%
  select(player, aav, szn) %>%
  left_join(
    cf_data %>%
      group_by(player) %>%
      summarise(gp = sum(gp)), 
    by = 'player'
  ) %>%
  filter(gp > 40)

aav_gp_data %>%
  mutate(szn = paste0(20, szn)) %>%
  ggplot(aes(szn, aav, color = szn, fill = szn)) +
  geom_violin(alpha = 0.8,
              show.legend = F, trim = T, width = 0.9) +
  geom_boxplot(width = 0.1, color = 'grey90', fill = 'grey50', alpha = 0.1) +
  labs(x = NULL, y = 'Average annual value ($)') +
  #ggtitle('Distribution of goalie average annual value') +
  scale_y_continuous(labels = scales::dollar) +
  fishualize::scale_color_fish_d(option = 'Scarus_globiceps') +
  fishualize::scale_fill_fish_d(option = 'Scarus_globiceps') +
  theme_classic() +
  theme(axis.line = element_blank(),
        axis.text.x = element_text(size = 12, color = 'grey25'),
        axis.text.y = element_text(size = 7, color = 'grey25'),
        axis.title.y = element_text(size = 9, color = 'grey25', face = 'italic'),
        plot.title = element_text(hjust = 0.5, size = 16, color = 'grey15', face= 'italic'))

Another way goalies are weird is that goalies in the past have not usually made as much as their skater teammates. This is starting to change however. For example, Carey Price signed a contract extension that took effect in the 18-19 season for $10.5M per year over 8 years, $84M in total by the time the contract is up (see above plot). This was an unprecedented signing at the time, but as teams start to value good, consistent goalies more and more, we are likely to see more high-valued goalie contracts. We can see from the plot that the median AAV has shifted in the past two seasons from $2M to closer to $3M. Also the AAV distribution is more spread out in the last couple seasons, instead of being clumped all near the lower end.

Contract inflation

  • How are goalie contracts inflating?
cf_data %>%
  group_by(exp_year) %>%
  summarise(med_aav = median(aav)) %>%
  ggplot(aes(exp_year, med_aav)) +
  geom_step(color = 'turquoise3', size = 1) +
  labs(x = 'Contract expiry year', y = 'Median average annual value of contract') +
  scale_y_continuous(labels = scales::dollar) +
  theme_classic() +
  theme(axis.line = element_blank())

Top paid players post-lockout

## Get top 10 paid players in league over last 7 years
top_10_paid <- 
  cf_data %>%
    group_by(player) %>%
    top_n(1, aav) %>%
    ungroup() %>%
    distinct(player, aav, team) %>%
    top_n(10, aav)

## Plot top 10 paid players
top_10_paid %>%
  mutate(team = substr(team, start = str_length(team) - 2, stop = str_length(team))) %>%
  mutate(player = paste0(player, '\n(', team, ')     ')) %>%
  ggplot() +
  geom_segment(aes(x = reorder(player, aav), xend = player, y = 0, yend = aav), color = 'grey50') +
  geom_point(aes(x = player, y = aav, color = team), size = 5, show.legend = F) +
  labs(x = NULL, y = 'Average annual value') +
  #ggtitle('Top 10 paid players in the league (as of 2018-19)') +
  coord_flip() +
  scale_color_fish_d(option = 'Callanthias_australis') +
  scale_y_continuous(labels = scales::dollar) +
  theme_classic() +
  theme(axis.line = element_blank(),
        axis.text.y = element_text(size = 10, color = 'grey25'),
        plot.title = element_text(hjust = 0.5, size = 16, color = 'grey25', face= 'italic'))

How Contracts Affect Play


  • How good are goalies that got paid a lot? (vice versa)

How Much Will Goalies Get Paid


  • Build multivariate regression model to predict contract average annual value
  • Make predictions and compare to actual contracts
## Prep and combine data

cf_reg_data <- 
  cf_data %>%
  mutate(
    ## Removes accents off names
    player = stri_trans_general(player, 'latin-ascii'),
    ## Transform all names to lowercase
    player = str_to_lower(player)) %>%
    ## Remove redundant cols
    select(-weight, -height, -pos, -team, -age, -gp, -w, -l, -so, -gaa, -sv_percent)

hr_reg_data <-
  hr_data %>%
  ## Transform all names to lowercase
  mutate(player = str_to_lower(player)) %>%
  # Filter for post lockout szns
  filter(szn != '12-13') %>%
  # Fix nicknames
  mutate(player = case_when(
    str_detect(player, 'cal heeter') ~ str_replace(player, 'cal heeter', 'calvin heeter'),
    str_detect(player, 'matt o\'connor') ~ str_replace(player, 'matt o\'connor', 'matthew o\'connor'),
    str_detect(player, 'eddie pasquale') ~ str_replace(player, 'eddie pasquale', 'edward pasquale'),
    TRUE ~ as.character(player)
  ))

mp_reg_data <- 
  mp_data %>%
  # Filter data for all situations (5v5, player-down, player-up)
  filter(situation == 'all', szn != '12-13') %>%
  # Remove redundant cols
  select(-player_id, -team, -season, -position, -games_played, 
         -penality_minutes, -penalties, -situation, -goals) %>%
  # Transform all names to lowercase
  mutate(name = str_to_lower(name)) %>%
  # Fix nicknames
  mutate(name = case_when(
    str_detect(name, 'tom mccollum') ~ str_replace(name, 'tom mccollum', 'thomas mccollum'),
    str_detect(name, 'j.f. berube') ~ str_replace(name, 'j.f. berube', 'jean-francois berube'),
    str_detect(name, 'j-f berube') ~ str_replace(name, 'j-f berube', 'jean-francois berube'),
    str_detect(name, 'cal petersen') ~ str_replace(name, 'cal petersen', 'calvin petersen'),
    TRUE ~ as.character(name)
  ))

# Join all data together
joined_data <-
  hr_reg_data %>%
  full_join(mp_reg_data, by = c('player' = 'name', 'szn' = 'szn')) %>%
  inner_join(cf_reg_data, by = c('player', 'szn')) %>%
  # Remove extra contract info (since we don't "know" this information yet)
  select(-rk, -cap_hit_percent, -salary, -length, -cap_hit) 

reg_data <- 
  joined_data %>%
  # Code binary variables
  mutate(handed = ifelse(handed == 'Left', 0, 1),
         expiry = ifelse(expiry == 'UFA', 0, 1)) %>%
  # Select numeric predictors
  select_if(is.numeric)

Feature selection

  • Way too many predictors
reg_data %>% ncol()
## [1] 60

Calculate information gain for predictors and grab 10 highest predictors in information gain

play_continued_in_zone x_on_goal unblocked_shot_attempts x_freeze min icetime low_danger_shots high_dangerx_goals gs play_continued_outside_zone aav
6 20.96 38 4.45 40 2400 17 0.68 1 11 900000
376 787.82 1498 179.44 1569 94166 870 19.83 24 397 1775000
726 1639.26 2844 383.84 3000 179926 1789 48.03 52 868 3187500
41 74.95 145 18.06 139 8023 79 1.78 3 23 625000
241 533.23 976 124.78 1094 65647 574 12.52 19 283 3416666
844 1820.57 3330 423.74 3084 184714 1964 42.03 49 879 2900000

Build regression model with selected features

## Model
mod <- lm(aav~., feat_selected)
summary(mod)
## 
## Call:
## lm(formula = aav ~ ., data = feat_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3810987  -779564   -36764   584496  5612561 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 686544.59  110073.72   6.237 9.13e-10 ***
## play_continued_in_zone        1228.14    2697.81   0.455   0.6491    
## x_on_goal                     2156.58    5600.18   0.385   0.7003    
## unblocked_shot_attempts      -1198.09    1224.44  -0.978   0.3283    
## x_freeze                     20710.43   20921.61   0.990   0.3227    
## min                          -2924.02    5158.02  -0.567   0.5710    
## icetime                        -22.11      84.85  -0.261   0.7945    
## low_danger_shots             -5424.90    2577.60  -2.105   0.0358 *  
## high_dangerx_goals           -4782.75   19989.19  -0.239   0.8110    
## gs                          351068.82   49849.96   7.043 5.90e-12 ***
## play_continued_outside_zone   -806.60    2853.88  -0.283   0.7776    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1425000 on 529 degrees of freedom
## Multiple R-squared:  0.5562, Adjusted R-squared:  0.5478 
## F-statistic: 66.29 on 10 and 529 DF,  p-value: < 2.2e-16

Predictions

  • Sergei Bobrovsky: got paid $11M AAV when he signed this season
new_data1 <- 
  joined_data %>%
  filter(player == 'sergei bobrovsky', szn == '18-19') %>%
  select(subset_info_gain)

predict(mod, new_data1)
##       1 
## 5294396
  • Mikko Koskinen: got paid $4.5M AAV when he signed an extension
new_data2 <-
  joined_data %>%
  filter(player == 'mikko koskinen') %>%
  select(subset_info_gain)

predict(mod, new_data2)
##       1 
## 4391092

Conclusions


Data


  • HockeyReference
  • CapFriendly
  • MoneyPuckf
  • MetaHockey